fault tolerance
Fault-Tolerant MARL for CAVs under Observation Perturbations for Highway On-Ramp Merging
Shi, Yuchen, Pei, Huaxin, Zhang, Yi, Yao, Danya
Multi-Agent Reinforcement Learning (MARL) holds significant promise for enabling cooperative driving among Connected and Automated Vehicles (CAVs). However, its practical application is hindered by a critical limitation, i.e., insufficient fault tolerance against observational faults. Such faults, which appear as perturbations in the vehicles' perceived data, can substantially compromise the performance of MARL-based driving systems. Addressing this problem presents two primary challenges. One is to generate adversarial perturbations that effectively stress the policy during training, and the other is to equip vehicles with the capability to mitigate the impact of corrupted observations. To overcome the challenges, we propose a fault-tolerant MARL method for cooperative on-ramp vehicles incorporating two key agents. First, an adversarial fault injection agent is co-trained to generate perturbations that actively challenge and harden the vehicle policies. Second, we design a novel fault-tolerant vehicle agent equipped with a self-diagnosis capability, which leverages the inherent spatio-temporal correlations in vehicle state sequences to detect faults and reconstruct credible observations, thereby shielding the policy from misleading inputs. Experiments in a simulated highway merging scenario demonstrate that our method significantly outperforms baseline MARL approaches, achieving near-fault-free levels of safety and efficiency under various observation fault patterns.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States (0.04)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Road (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (3 more...)
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
- North America > United States > North Carolina > Wake County > Raleigh (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Israel (0.04)
FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization
Liu, Fangxin, Wang, Zongwu, Xia, JinHong, Zhao, Junping, Zhao, Shouren, Li, Jinjin, Liu, Jian, Jiang, Li, Guan, Haibing
The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.
Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms
--An active approach to fault tolerance is essential for robot swarms to achieve long-term autonomy. Previous e fforts have focused on responding to spontaneous electro-mechanical faults and failures. However, many faults occur gradually over time. This work argues that the principles of predictive maintenance, in which potential faults are resolved before they hinder the operation of the swarm, o ffer a promising means of achieving long-term fault tolerance. This is a novel approach to swarm fault tolerance, which is shown to give a comparable or improved performance when tested against a reactive approach in almost all cases tested. However, a significant barrier to the deployment of autonomous robots in many real-world applications is the risk of failure or loss of autonomous control in the field.
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
- North America > United States > North Carolina > Wake County > Raleigh (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Israel (0.04)
A Three-Level Whole-Body Disturbance Rejection Control Framework for Dynamic Motions in Legged Robots
Li, Bolin, Zuo, Gewei, Wang, Zhixiang, Ke, Xiaotian, Zhu, Lijun, Ding, Han
Abstract--This paper presents a control framework designed to enhance the stability and robustness of legged robots in the presence of uncertainties, including model uncertainties, external disturbances, and faults. The framework enables the full-state feedback estimator to estimate and compensate for uncertainties in the whole-body dynamics of the legged robots. First, we propose a novel moving horizon extended state observer (MH-ESO) to estimate uncertainties and mitigate noise in legged systems, which can be integrated into the framework for disturbance compensation. Second, we introduce a three-level whole-body disturbance rejection control framework (T -WB-DRC). Unlike the previous two-level approach, this three-level framework considers both the plan based on whole-body dynamics without uncertainties and the plan based on dynamics with uncertainties, significantly improving payload transportation, external disturbance rejection, and fault tolerance. Third, simulations of both humanoid and quadruped robots in the Gazebo simulator demonstrate the effectiveness and versatility of T -WB-DRC. Note to Practitioners--This paper presents a practical control framework to significantly improve the robustness of legged robots against real-world uncertainties like unknown payloads, external pushes, and actuator faults. Its core is a novel three-level whole-body controller (T -WB-DRC) that uses a moving horizon estimator (MH-ESO) to accurately identify and compensate for disturbances in real-time. This dual-planning approach, which considers both ideal and disturbance-injected dynamics, outperforms previous methods. The framework's effectiveness in enhancing stability under disturbances has been successfully validated through extensive simulations and physical experiments on a quadruped robot.
- Asia > China > Hubei Province > Wuhan (0.05)
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- (5 more...)
FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention
Dai, Huangliang, Wu, Shixun, Huang, Jiajun, Jian, Zizhe, Zhu, Yue, Hu, Haiyang, Chen, Zizhong
Transformer models rely on High-Performance Computing (HPC) resources for inference, where soft errors are inevitable in large-scale systems, making the reliability of the model particularly critical. Existing fault tolerance frameworks for Transformers are designed at the operation level without architectural optimization, leading to significant computational and memory overhead, which in turn reduces protection efficiency and limits scalability to larger models. In this paper, we implement module-level protection for Transformers by treating the operations within the attention module as a single kernel and applying end-to-end fault tolerance. This method provides unified protection across multi-step computations, while achieving comprehensive coverage of potential errors in the nonlinear computations. For linear modules, we design a strided algorithm-based fault tolerance (ABFT) that avoids inter-thread communication. Experimental results show that our end-to-end fault tolerance achieves up to 7.56x speedup over traditional methods with an average fault tolerance overhead of 13.9%.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Riverside County > Riverside (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Architecture (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers
Titopoulos, Vasileios, Alexandridis, Kosmas, Dimitrakopoulos, Giorgos
Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead and less than 1.9% energy overhead, making it a cost-effective and robust solution for error detection in attention accelerators.
Fault-Tolerant Multi-Robot Coordination with Limited Sensing within Confined Environments
Aina, Kehinde O., Bagheri, Hosain, Goldman, Daniel I.
As robots are increasingly deployed to collaborate on tasks within shared workspaces and resources, the failure of an individual robot can critically affect the group's performance. This issue is particularly challenging when robots lack global information or direct communication, relying instead on social interaction for coordination and to complete their tasks. In this study, we propose a novel fault-tolerance technique leveraging physical contact interactions in multi-robot systems, specifically under conditions of limited sensing and spatial confinement. We introduce the "Active Contact Response" (ACR) method, where each robot modulates its behavior based on the likelihood of encountering an inoperative (faulty) robot. Active robots are capable of collectively repositioning stationary and faulty peers to reduce obstructions and maintain optimal group functionality. We implement our algorithm in a team of autonomous robots, equipped with contact-sensing and collision-tolerance capabilities, tasked with collectively excavating cohesive model pellets. Experimental results indicate that the ACR method significantly improves the system's recovery time from robot failures, enabling continued collective excavation with minimal performance degradation. Thus, this work demonstrates the potential of leveraging local, social, and physical interactions to enhance fault tolerance and coordination in multi-robot systems operating in constrained and extreme environments.